Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform FERC-714 load forecast table #3670

Merged

Conversation

seeess1
Copy link
Contributor

@seeess1 seeess1 commented Jun 14, 2024

Overview

NOTE: This is still a WIP. Just putting up a PR for now to make sure I'm on the right track and to get input.

Closes #3519.

What problem does this address?

This creates a core asset out of the raw_ferc714__yearly_planning_area_forecast_demand data.

What did you change?

(Still in progress)

Testing

(Still in progress)

How did you make sure this worked? How can a reviewer verify this?

  • I ran a local version of this and checked the dataframe output.

To-do list

@zaneselvans zaneselvans changed the title Add initial transformation framework Transform FERC-714 load forecast table Jun 17, 2024
@zaneselvans zaneselvans added ferc714 Anything having to do with FERC Form 714 new-data Requests for integration of new data. labels Jun 17, 2024
@seeess1
Copy link
Contributor Author

seeess1 commented Jun 19, 2024

Results from make pytest-coverage:

============================== 85 passed, 4 skipped, 6 xfailed, 2 xpassed, 901 warnings in 7104.04s (1:58:24) ===============================
coverage report
Name                                                           Stmts   Miss   Cover
-----------------------------------------------------------------------------------
src/pudl/__init__.py                                              14      0 100.00%
src/pudl/analysis/__init__.py                                      1      0 100.00%
src/pudl/analysis/fuel_by_plant.py                                37      0 100.00%
src/pudl/analysis/ml_tools/__init__.py                             3      0 100.00%
src/pudl/analysis/ml_tools/experiment_tracking.py                 52      0 100.00%
src/pudl/analysis/record_linkage/__init__.py                       1      0 100.00%
src/pudl/analysis/record_linkage/eia_ferc1_model_config.py        25      0 100.00%
src/pudl/convert/__init__.py                                       1      0 100.00%
src/pudl/etl/eia_bulk_elec_assets.py                               9      0 100.00%
src/pudl/etl/epacems_assets.py                                    58      0 100.00%
src/pudl/etl/static_assets.py                                     26      0 100.00%
src/pudl/extract/__init__.py                                       1      0 100.00%
src/pudl/extract/csv.py                                           16      0 100.00%
src/pudl/extract/eia176.py                                        18      0 100.00%
src/pudl/extract/eia191.py                                        18      0 100.00%
src/pudl/extract/eia757a.py                                       18      0 100.00%
src/pudl/extract/eia860.py                                        38      0 100.00%
src/pudl/extract/eia860m.py                                       45      0 100.00%
src/pudl/extract/eia923.py                                        50      0 100.00%
src/pudl/extract/eia_bulk_elec.py                                 46      0 100.00%
src/pudl/extract/epacems.py                                       46      0 100.00%
src/pudl/extract/ferc.py                                           7      0 100.00%
src/pudl/extract/ferc714.py                                       22      0 100.00%
src/pudl/extract/gridpathratoolkit.py                             39      0 100.00%
src/pudl/extract/nrelatb.py                                       11      0 100.00%
src/pudl/extract/parquet.py                                       13      0 100.00%
src/pudl/ferc_to_sqlite/__init__.py                               20      0 100.00%
src/pudl/glue/__init__.py                                          1      0 100.00%
src/pudl/metadata/__init__.py                                      2      0 100.00%
src/pudl/metadata/codes.py                                         6      0 100.00%
src/pudl/metadata/constants.py                                    21      0 100.00%
src/pudl/metadata/dfs.py                                          11      0 100.00%
src/pudl/metadata/enums.py                                        36      0 100.00%
src/pudl/metadata/labels.py                                       12      0 100.00%
src/pudl/metadata/resources/__init__.py                           12      0 100.00%
src/pudl/metadata/resources/allocate_gen_fuel.py                   4      0 100.00%
src/pudl/metadata/resources/eia.py                                 4      0 100.00%
src/pudl/metadata/resources/eia860.py                              3      0 100.00%
src/pudl/metadata/resources/eia860m.py                             2      0 100.00%
src/pudl/metadata/resources/eia861.py                              2      0 100.00%
src/pudl/metadata/resources/eia923.py                              4      0 100.00%
src/pudl/metadata/resources/eia930.py                              2      0 100.00%
src/pudl/metadata/resources/eia_bulk_elec.py                       2      0 100.00%
src/pudl/metadata/resources/eiaaeo.py                              4      0 100.00%
src/pudl/metadata/resources/epacems.py                             3      0 100.00%
src/pudl/metadata/resources/ferc1.py                               4      0 100.00%
src/pudl/metadata/resources/ferc1_eia_record_linkage.py            3      0 100.00%
src/pudl/metadata/resources/ferc714.py                             3      0 100.00%
src/pudl/metadata/resources/glue.py                                3      0 100.00%
src/pudl/metadata/resources/gridpathratoolkit.py                   2      0 100.00%
src/pudl/metadata/resources/mcoe.py                                4      0 100.00%
src/pudl/metadata/resources/nrelatb.py                             2      0 100.00%
src/pudl/metadata/resources/pudl.py                                3      0 100.00%
src/pudl/metadata/sources.py                                       5      0 100.00%
src/pudl/output/__init__.py                                        1      0 100.00%
src/pudl/output/eia860.py                                         19      0 100.00%
src/pudl/output/eia_bulk_elec.py                                  14      0 100.00%
src/pudl/output/sql/helpers.py                                    12      0 100.00%
src/pudl/resources.py                                             20      0 100.00%
src/pudl/transform/__init__.py                                     1      0 100.00%
src/pudl/transform/eia930.py                                      19      0 100.00%
src/pudl/transform/eia_bulk_elec.py                               26      0 100.00%
src/pudl/transform/ferc714.py                                     87      0 100.00%
src/pudl/transform/params/__init__.py                              1      0 100.00%
src/pudl/transform/params/ferc1.py                                67      0 100.00%
src/pudl/workspace/__init__.py                                     1      0 100.00%
test/integration/console_scripts_test.py                          23      0 100.00%
test/integration/datasette_metadata_test.py                       25      0 100.00%
test/integration/etl_test.py                                      84      0 100.00%
test/integration/ferc1_eia_train_test.py                          30      0 100.00%
test/integration/ferc_dbf_extract_test.py                         38      0 100.00%
test/integration/zenodo_datapackage_test.py                       11      0 100.00%
test/unit/analysis/ml_tools_test.py                               20      0 100.00%
test/unit/analysis/plant_parts_eia_test.py                        48      0 100.00%
test/unit/analysis/spatial_test.py                                67      0 100.00%
test/unit/analysis/state_demand_test.py                            8      0 100.00%
test/unit/analysis/timeseries_cleaning_test.py                    32      0 100.00%
test/unit/console_scripts_test.py                                 10      0 100.00%
test/unit/extract/csv_test.py                                     48      0 100.00%
test/unit/extract/eia_bulk_elec_test.py                           42      0 100.00%
test/unit/extract/excel_test.py                                   34      0 100.00%
test/unit/extract/extractor_test.py                               19      0 100.00%
test/unit/extract/xbrl_test.py                                    49      0 100.00%
test/unit/helpers_test.py                                        162      0 100.00%
test/unit/output/epacems_test.py                                   7      0 100.00%
test/unit/transform/classes_test.py                              217      0 100.00%
test/unit/transform/eia923_test.py                                12      0 100.00%
test/unit/transform/eia_bulk_elec_test.py                         18      0 100.00%
test/unit/transform/eiaaeo_test.py                                11      0 100.00%
test/unit/transform/epacems_test.py                                8      0 100.00%
test/unit/transform/ferc1_test.py                                218      0 100.00%
test/unit/transform/glue_test.py                                  13      0 100.00%
test/unit/workspace/datastore_test.py                            114      0 100.00%
test/unit/workspace/resource_cache_test.py                       140      0 100.00%
src/pudl/analysis/ml_tools/models.py                              38      1  97.37%
src/pudl/analysis/record_linkage/eia_ferc1_inputs.py              71      1  98.59%
src/pudl/extract/eia861.py                                        37      1  97.30%
src/pudl/extract/eia930.py                                        25      1  96.00%
src/pudl/extract/ferc1.py                                        113      1  99.12%
src/pudl/extract/xbrl.py                                          46      1  97.83%
src/pudl/metadata/fields.py                                       29      1  96.55%
src/pudl/transform/eia860m.py                                     23      1  95.65%
test/integration/epacems_test.py                                  39      1  97.44%
test/integration/output_test.py                                   63      1  98.41%
test/unit/output/ferc1_test.py                                   168      1  99.40%
test/unit/settings_test.py                                       164      1  99.39%
src/pudl/__main__.py                                               2      2   0.00%
src/pudl/analysis/mcoe.py                                         78      2  97.44%
src/pudl/analysis/record_linkage/classify_plants_ferc1.py         32      2  93.75%
src/pudl/etl/glue_assets.py                                      112      2  98.21%
src/pudl/glue/ferc1_eia.py                                       135      2  98.52%
src/pudl/output/censusdp1tract.py                                 29      2  93.10%
src/pudl/output/eia923.py                                         93      2  97.85%
src/pudl/transform/gridpathratoolkit.py                           32      2  93.75%
test/unit/analysis/allocate_gen_fuel_test.py                      96      2  97.92%
test/unit/harvest_test.py                                         60      2  96.67%
test/unit/metadata_test.py                                        69      2  97.10%
src/pudl/logging_helpers.py                                       15      3  80.00%
src/pudl/output/epacems.py                                        29      3  89.66%
src/pudl/transform/epacems.py                                     34      3  91.18%
src/pudl/workspace/setup.py                                       47      3  93.62%
test/integration/record_linkage_test.py                           73      3  95.89%
src/pudl/convert/censusdp1tract_to_sqlite.py                      35      4  88.57%
src/pudl/metadata/helpers.py                                     182      4  97.80%
test/integration/glue_test.py                                     42      4  90.48%
test/unit/glue.py                                                  7      4  42.86%
src/pudl/extract/excel.py                                         75      5  93.33%
src/pudl/extract/ferc2.py                                         21      5  76.19%
src/pudl/extract/ferc6.py                                         15      5  66.67%
src/pudl/extract/ferc60.py                                        15      5  66.67%
src/pudl/extract/phmsagas.py                                      25      5  80.00%
src/pudl/output/ferc714.py                                       151      5  96.69%
test/integration/jupyter_notebooks_test.py                        13      5  61.54%
src/pudl/etl/__init__.py                                          57      6  89.47%
src/pudl/settings.py                                             373      6  98.39%
test/unit/conftest.py                                             22      6  72.73%
src/pudl/convert/metadata_to_rst.py                               21      7  66.67%
src/pudl/extract/extractor.py                                    149      7  95.30%
src/pudl/transform/nrelatb.py                                    106      7  93.40%
src/pudl/analysis/record_linkage/eia_ferc1_record_linkage.py     188      8  95.74%
src/pudl/analysis/spatial.py                                     112      8  92.86%
src/pudl/analysis/plant_parts_eia.py                             260      9  96.54%
src/pudl/transform/eia.py                                        309      9  97.09%
src/pudl/transform/eia861.py                                     409     10  97.56%
src/pudl/workspace/resource_cache.py                             122     10  91.80%
test/unit/io_managers_test.py                                    221     10  95.48%
src/pudl/etl/check_foreign_keys.py                                63     11  82.54%
src/pudl/output/pudltabl.py                                       98     11  88.78%
src/pudl/analysis/epacamd_eia.py                                  18     12  33.33%
src/pudl/analysis/record_linkage/link_cross_year.py              116     12  89.66%
src/pudl/extract/eiaaeo.py                                       139     12  91.37%
src/pudl/transform/eiaaeo.py                                     136     12  91.18%
src/pudl/analysis/allocate_gen_fuel.py                           323     14  95.67%
src/pudl/analysis/record_linkage/name_cleaner.py                  93     14  84.95%
src/pudl/analysis/state_demand.py                                146     15  89.73%
src/pudl/ferc_to_sqlite/cli.py                                    43     21  51.16%
src/pudl/transform/classes.py                                    394     21  94.67%
src/pudl/etl/cli.py                                               47     22  53.19%
src/pudl/workspace/datastore.py                                  311     22  92.93%
src/pudl/transform/eia860.py                                     254     23  90.94%
src/pudl/extract/dbf.py                                          264     25  90.53%
src/pudl/io_managers.py                                          283     25  91.17%
src/pudl/analysis/record_linkage/embed_dataframe.py              175     29  83.43%
src/pudl/analysis/service_territory.py                           121     31  74.38%
src/pudl/transform/eia923.py                                     281     38  86.48%
src/pudl/analysis/timeseries_cleaning.py                         460     44  90.43%
src/pudl/helpers.py                                              465     52  88.82%
src/pudl/transform/ferc1.py                                     1593     61  96.17%
src/pudl/analysis/record_linkage/eia_ferc1_train.py              172     71  58.72%
src/pudl/output/eia.py                                           176     72  59.09%
src/pudl/output/ferc1.py                                         754     96  87.27%
src/pudl/metadata/classes.py                                     882     99  88.78%
-----------------------------------------------------------------------------------
TOTAL                                                          15060   1068  92.91%

@seeess1
Copy link
Contributor Author

seeess1 commented Jun 19, 2024

@zaneselvans Been looking around at the tests in the repo. Correct me if I'm wrong but seems like no new unit tests needed here. My changes just rely on schema enforcement, which seems to already be tested in /test/unit/metadata_test.py.

But I'm wondering about adding a validation script for FERC 714 since we don't have one yet in /test/validate. I could try adding a simple test_minmax_rows function in there. Then again, I don't want to unnecessarily duplicate code.

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together! I made some minor naming & Dagster infrastructure comments.

Have you done any plotting or other investigation that might highlight data quality issues? For a visual sanity check on the data, would you be up for using it to recreate this plot from RMI out of this article?

image

Comment on lines 547 to 550
io_manager_key="parquet_io_manager",
op_tags={"memory-use": "high"}, # Should this be high?
compute_kind="pandas",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I believe this asset should use the pudl_io_manager -- in general we send all output to both SQLite and Parquet. Only the hourly tables (which can run into the hundreds of millions of rows) are written to Parquet alone.
  • I don't think this asset will be high memory -- with just annual data and 10 years of forecasts for each planning area it'll be a pretty small table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making these updates now.

In terms of when to use pudl_io_manager vs parquet_io_manager, is it basically a judgment call/depends on how the dev work is going? As in, if the transformation we're trying to run is going to output data of at least a certain size, then we go with parquet? Similar question for tags like memory-use - it's sort of on a case-by-case basis?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we use pudl_io_manager for everything with the exception of the hourly time series, which can be quite long. For hourly tables we use parquet_io_manager Not all of the hourly tables really need to only be in Parquet, but enough of them do that it seemed simpler to make this clear distinction.

"High" memory use is any transform that has a peak memory usage that's higher than one CPU's fair share of the memory on the nightly build machine (16 CPUs and 64GB of memory, so anything larger than 4GB). Identifying which assets fall into this category is haphazard right now -- I basically ran all of the ones I thought might be high memory and watched resource consumption on my laptop. I'm sure we could do some real profiling and get a more comprehensive answer.

src/pudl/transform/ferc714.py Outdated Show resolved Hide resolved
@zaneselvans
Copy link
Member

zaneselvans commented Jun 19, 2024

Been looking around at the tests in the repo. Correct me if I'm wrong but seems like no new unit tests needed here. My changes just rely on schema enforcement, which seems to already be tested in /test/unit/metadata_test.py.

I think this is correct. We rely pretty heavily on running a "fast" ETL for integration testing, which I think should exercise your new code, since the FERC-714 module is already being scanned for assets by Dagster.

But I'm wondering about adding a validation script for FERC 714 since we don't have one yet in /test/validate. I could try adding a simple test_minmax_rows function in there. Then again, I don't want to unnecessarily duplicate code.

We are just starting the process of converting the validation tests into Dagster asset checks, so that they run during the ETL instead of infrequently and after the fact. @jdangerx has prototyped this kind of check in pudl.transform.eiaaeo if you want to take a look there for an example of how to integrate per-year expected row count checks.

@seeess1
Copy link
Contributor Author

seeess1 commented Jun 23, 2024

Also yes let me run some sanity checks and viz. And I’ll take a look at the Dagster asset checks you mentioned.

@seeess1
Copy link
Contributor Author

seeess1 commented Jul 10, 2024

Been looking around at the tests in the repo. Correct me if I'm wrong but seems like no new unit tests needed here. My changes just rely on schema enforcement, which seems to already be tested in /test/unit/metadata_test.py.

I think this is correct. We rely pretty heavily on running a "fast" ETL for integration testing, which I think should exercise your new code, since the FERC-714 module is already being scanned for assets by Dagster.

But I'm wondering about adding a validation script for FERC 714 since we don't have one yet in /test/validate. I could try adding a simple test_minmax_rows function in there. Then again, I don't want to unnecessarily duplicate code.

We are just starting the process of converting the validation tests into Dagster asset checks, so that they run during the ETL instead of infrequently and after the fact. @jdangerx has prototyped this kind of check in pudl.transform.eiaaeo if you want to take a look there for an example of how to integrate per-year expected row count checks.

Added some row count checks in my latest commit.

@seeess1
Copy link
Contributor Author

seeess1 commented Jul 13, 2024

Just caught an issue that I'm handling now

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@seeess1
Copy link
Contributor Author

seeess1 commented Jul 18, 2024

@zaneselvans I've added some initial analysis in a notebook here. Seeing a couple interesting things so far. Let me know what you think.

One thing that's a bit concerning is that not all the respondents provided 10 years' worth of predictions in every report year. There were only a few instances where this is the case:

image

Not sure if we want to do something specific about this but wanted to flag it with you in case.

@zaneselvans
Copy link
Member

zaneselvans commented Jul 18, 2024

Huh, that's weird that they don't always report 10 years of data, but pretty rare and also not crazy in the context of how bad a lot of the FERC data is. I wouldn't worry about it for the moment.

It looks like there's an alembic migration error now, probably due to the newly merged in changes from main. Are you familiar with alembic? Often what we do in this situation is downgrade the local DB by one revision. merge the changes in from main and then alembic upgrade head and finally create a new migration with the new changes from the branch. It's been a while since I had to do this though. Maybe @e-belfer or @zschira remember the exact incantations.

I'm curious if you tried either of the asset loading methods I had suggested, and if they worked or didn't work.

from dagster import AssetKey
from pudl.etl import defs

demand_forecasts = defs.load_asset_value(AssetKey("core_ferc714__yearly_planning_area_demand_forecast"))
import pandas as pd
from pudl.workspace.setup import PudlPaths
pudl_paths = PudlPaths()

demand_forecasts = pd.read_parquet(
    pudl_paths.parquet_path("core_ferc714__yearly_planning_area_demand_forecast"),
    dtype_backend="pyarrow",
)

@zschira
Copy link
Member

zschira commented Jul 18, 2024

@seeess1 and @zaneselvans I took a look at the alembic issue and I think it should be a pretty easy fix. You should be able to use the following commands to satisfy alembic:

alembic merge heads
alembic upgrade head

Then, you can git add migrations/ and commit this to your branch and that should fix the issue.

@seeess1
Copy link
Contributor Author

seeess1 commented Jul 18, 2024

Thanks a ton @zschira! Just ran the commands you shared and pushed up the changes. Will keep an eye on the latest build to see if the errors go away.

@zaneselvans I tried loading the new asset other ways, including the ones you posted above, but none are working for me :( . Here's what I get when I try the first one:

DagsterResourceFunctionError: Error executing resource_fn on ResourceDefinition pudl_io_manager

And here's what I get when I try the second one:

Field required [type=missing, input_value={}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.7/v/missing
pudl_output
  Field required [type=missing, input_value={}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.7/v/missing

Maybe there's something very simple that I need to do but not sure.

Re some respondents not reporting all 10 years of data: sounds good. I'll keep the data as-is. In the meantime, I'll play around a little more with the data we have to try and compare it with the actual usage for these respondents.

@zaneselvans
Copy link
Member

Hmm, that seems odd. I'd really like to figure out what's up so we can help other folks avoid it. A few questions:

  • How are you running your notebook? Are you running jupyterlab locally? Using VS Code? Using the Jupyter desktop app?
  • Are you running the notebook from within the pudl-dev conda environment? (activated pudl-dev before running jupyterlab, have the pudl-dev environment selected for the notebook pane in VS Code, etc.)
  • When was the last time you rebuilt the pudl-dev environment?
  • What do get from the following commands:
mamba list | grep catalystcoop
echo $DAGSTER_HOME
echo $PUDL_OUTPUT

If you run python at the command line with the pudl-dev environment activated rather than in a notebook do either of the access options above work?

@zaneselvans
Copy link
Member

Ah, I just tried running the integration tests locally and they failed because the assertion about duplicate rows is only true when you process all years of data. Probably want to set an upper bound on dupes rather than an exact number.

E       AssertionError: Expected 20 duplicates removed, but found 0

@seeess1
Copy link
Contributor Author

seeess1 commented Jul 19, 2024

Ah, I just tried running the integration tests locally and they failed because the assertion about duplicate rows is only true when you process all years of data. Probably want to set an upper bound on dupes rather than an exact number.

E       AssertionError: Expected 20 duplicates removed, but found 0

Good catch. Updated here:

8f5f447

@zaneselvans
Copy link
Member

Okay that looks about like what I would expect to see in the VS Code. If it's running on Python 3.12.3 instead of 3.12.4 it's probably quite stale though. We update the dependency lockfile every Monday, so the first thing I would try is rebuilding the environment and restarting VS Code.

@zaneselvans zaneselvans added this pull request to the merge queue Jul 20, 2024
Merged via the queue into catalyst-cooperative:main with commit ee975b2 Jul 20, 2024
7 checks passed
@zaneselvans
Copy link
Member

I disabled the build-release check since it very rarely fails, and we need contributors to be able to use the merge queue, and got this merged in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ferc714 Anything having to do with FERC Form 714 new-data Requests for integration of new data.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Transform FERC-714 load forecast table
3 participants